| Due date | Turn in | Points |
|---|---|---|
| 11:45pm Mon Aug 4 | Assignment 1 Repo on GitHub has been created by due date, and commit history shows steady effort on analysis. | 3 |
| 11:45pm Mon Aug 18 | Final solutions available on repo. | 12 |
ETC5521 Diving Deeper into Data Exploration: Assignment 1
As per Monash’s integrity rules, these solutions are not to be shared beyond this class.
🎯 Goal
The assignment is designed to assess your knowledge of data wrangling and GitHub is at a level to be able to successfully follow the content of this class. The assignment represents 15% of your final grade for ETC5521. This is an individual assignment.
📌 Guidelines
Accept the GitHub Classroom Assignment provided in Moodle using a GitHub Classroom compatible web browser. This should generate a private GitHub repository that can be found at https://github.com/etc5521-2025. Your GitHub assignment 1 repo should contain the file
assign01.html,README.md,assign01-submission.qmd,assignment.css,etc5521-assignment1.Rprojand.gitignore.Answer each question in the
assign01-submission.qmdin the repo.For the final submission knit
assign01-submission.qmdwhich will contain your answers. Make sure to provide the link to the script of Generative AI conversation you employed in arriving at your solution. Note that marks are allocated for overall grammar and structure of your final report.Leave all of your files in your GitHub repo for marking. We will check your git commit history. You should have contributions to the repo with consistent commits over time. (Note: nothing needs to be submitted to Moodle.)
You are expected to develop your solutions by yourself, without discussing any details with other class members or other friends or contacts. You can ask for clarifications from the teaching team and we encourage you to attend consultations to get assistance as needed. As a Monash student you are expected to adhere to Monash’s academic integrity policy. and the details on use of Generative AI as detailed on this unit’s Moodle assessment overview. Failure to adhere to this policy may result in a ZERO for this assignment, followed by an academic integrity breach report. The chief examiner reserves the right to question you about any part of your solution.
The primary sources for methods needed for this assignment are R for Data Science (2e) and (Text Mining with R: A Tidy Approach)[https://www.tidytextmining.com).
We expect that this assignment takes about 10 hours of time to complete. You should work on this analysis steadily over the time period between release and due date. Spend a couple of hours soon after the assignment is released getting started, and several hours each of the following two weeks to refine your analysis. Your GitHub
commithistory should reflect this working pattern.
Deadlines:
Marks
| Part | Points |
|---|---|
| GitHub Repo | 3 |
| Q1-Q4 each worth | 3 |
| References and AI | -3 |
| Formatting, spelling, grammar, code and reproducibility | -3 |
Appropriate use of GitHub is an important collaborative analysis skill, and demonstrating this counts towards marks.
Spelling and grammar mistakes and lack of nice formatting, detract from the score because it makes it harder to mark, and harder to read. It is expected that we can reproduce your report, so all code needs to be included, and the code needs to be readable.
Using GAI well is an emerging skill. Inadequate use or over-use without fully processing the responses detracts from a data analysis.
🛠️ Exercises
The data to use is available in the orcas R package on GitHub. The package can be installed using
remotes::install_github("jadeynryan/orcas")
to access the data. The descriptions of the variables can be found here.
Why are we interested in orca encounters? Whale watching is a major tourism business in many parts of the world. Monitoring the population of whales is important for the sustainability of many businesses and for the health of the planet. Beyond this, orcas are cool! They are intelligent, social, beautiful and one of the top predators in the ocean.
An interesting fact is that orcas routinely helped whalers hunt baleen whales in Two Fold Bay near Eden, Australia, in the 1800s. They would herd the whales into the bay for the whalers. When the whale was killed, the whalers rewarded them with the whales’ tongue and lips. Actually, this is quite gruesome 😳 😬.
Question 1
Summarise the temporal patterns in this data. For example,
- what is the time frame of the data collection?
- are there any seasonal patterns in the measurements?
- what is the usual length of the encounters?
Question 2
Summarise the spatial patterns, for example
- where are these encounters happening?
- are the boats following the whales based on the tracks of the encounters?
Question 3
Summarise the encounters by the vessels and observers. Are they the same whales that are frequently seen?
- Are there some especially frequent observers or active vessels?
- Are the long encounters made by special vessels?
- Do some vessels make multiple encounters?
Question 4
Each encounter has a text description. Summarise the common words used for the encounters.
Resources
In this part, you should cite major resources used, including R packages, and actively discuss how generative AI helped with your answers to the assignment questions, and where or how it was mistaken or misleading.
You need to provide links to the full script of your conversations generative AI tools. You should not use a paid service, as the freely available systems will sufficiently helpful.
For example, the citation() function in R can give R package details:
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686..
The links to my use of ChatGPT for help on this assignment are:
- https://chatgpt.com/share/68857a22-debc-8001-bfcc-10552befba4c helped to construct the regex to process the duration variable.
Rubric
To help you complete in your report, below is a rubric to guide you to what we are expecting:
| content | Excellent (HD) | Very good (D) | Good (C) | Satisfactory (P) | Unsatisfactory (F) |
|---|---|---|---|---|---|
| Q1-4 | Plots and summaries are comprehensive and concise. Summaries are designed well and polished. Text is used to summarise the tables and plots. Anything interesting or problematic with the data is discussed. | Plots and summaries are complete and concise. Summaries are appropriate. Text is addd to summarise the tables and plots. Interesting or problematic aspects of the data are reported. | Plots and summaries mostly answer questions raised by the initial expectations, are provide good answers and reasoning. | Plots and summaries are not complete or there are too many summaries provided. Summaries are mostly appropriate. Text is addd to summarise the tables and plots. | Plots and tables are not well matched to the needed, and text explanations are not provided. Plots are not readable. |
| Repo | Actively commiting during entire assignment period, with informative commit messages, done after each small change to the work. Commiting changes during most of assignment period, with clear evolution of the data analysis. (3) | Repo not accepted in time, and less than 11 commits (2) | Repo not accepted in time, and less than 5 commits (1) | Repo not accepted in time, and a single commit (0) | |
| GAI | GAI used effectively, deeply, and explained, script linked to report (0 deduction) | GAI used but not explained well or inadequate and script linked to report (-0.5) | Shallow use of GAI, script linked to report (-1) | Clearly used but no script linked to report (-3) | |
| Reproducibility | No changes needed for report to reproduce exactly as provided. Code is nicely formatted, commented and readable using appropriate tidyverse standards. (0 deduction) | Single change needed for report to reproduce exactly as provided. Code is nicely formatted, commented and readable using appropriate tidyverse standards. (0 deduction) | Just a few changes needed for report to reproduce exactly as provided. Code is formatted, commented and readable using appropriate tidyverse standards. (-0.5) | Multiple changes needed for report to reproduce exactly as provided. Code is readable. (-1) | Cannot easily make changes for report to reproduce at all. Code is not readable. (-2) |
| Spelling/Grammar | Writing style is exceptional, scholarly and succinct that is free from spelling, grammar and punctuation errors. (0 deduction) | Writing style is scholarly, free from spelling, grammar and punctuation errors. (0 deduction) | Writing style is scholarly, but wordy and inconcise. Free from spelling, grammar and punctuation errors. (-0.5) | Writing is scholarly and wordy. Contains some grammatical, punctuation and spelling errors. (-1) | Writing is unscholarly. Many grammatical, punctuation and spelling errors. (-2) |
| References | The appropriate referencing style has been used consistently, with no errors. Includes citations for software used, and data sources. (0 deduction) | The appropriate referencing style has been used consistently, with very few errors, and includes software used, and data sources. (0 deduction) | The appropriate referencing style has been used consistently, and only a few citations missing. (-0.5) | The appropriate referencing style has been used much of the time, missing some major sources that were clearly used. (-1) | Material used from external sources without citation. (-2) |